Introduction#
This notebook contains exploratory data analysis, feature engineering and modelling for the M% Forecasting Notebook.
Table of Contents
Fetch the data1 Downcasting2 Melting the data3 Exploratory Data Analysis4 Feature Engineering5 Modelling and Prediction6import os
import pandas as pd
import numpy as np
import plotly_express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import warnings
warnings.filterwarnings('ignore')
from lightgbm import LGBMRegressor
import joblib
1. Fetch the data#
SALES_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sales_train_evaluation.csv'
CALENDAR_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/calendar.csv'
PRICES_PATH = '/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sell_prices.csv'
sales = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sales_train_evaluation.csv')
sales.name = 'sales'
calendar = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/calendar.csv')
calendar.name = 'calendar'
prices = pd.read_csv('/ssd003/projects/forecasting_bootcamp/bootcamp_datasets/m5-forecasting-accuracy/sell_prices.csv')
prices.name = 'prices'
sales = pd.read_csv(SALES_PATH)
sales.name = 'sales'
calendar = pd.read_csv(CALENDAR_PATH)
calendar.name = 'calendar'
prices = pd.read_csv(PRICES_PATH)
prices.name = 'prices'
Since, the validation data is now available for the days 1914-1941, Adding zero sales for days: d_1942 - d_1969(Test)
#Add zero sales for the remaining days 1942-1969
for d in range(1942,1970):
col = 'd_' + str(d)
sales[col] = 0
sales[col] = sales[col].astype(np.int16)
2. Downcasting#
In this section I’ll be downcasting the dataframes to reduce the amount of storage used by them and also to expidite the operations performed on them.
**Numerical Columns: ** Depending on your environment, pandas automatically creates int32, int64, float32 or float64 columns for numeric ones. If you know the min or max value of a column, you can use a subtype which is less memory consuming. You can also use an unsigned subtype if there is no negative value.
Here are the different subtypes you can use:
int8 / uint8: consumes 1 byte of memory, range between -128/127 or 0/255
bool: consumes 1 byte, true or false
float16 / int16 / uint16: consumes 2 bytes of memory, range between -32768 and 32767 or 0/65535
float32 / int32 / uint32: consumes 4 bytes of memory, range between -2147483648 and 2147483647
float64 / int64 / uint64: consumes 8 bytes of memory
If one of your column has values between 1 and 10 for example, you will reduce the size of that column from 8 bytes per row to 1 byte, which is more than 85% memory saving on that column!**Categorical Columns: ** Pandas stores categorical columns as objects. One of the reason this storage is not optimal is that it creates a list of pointers to the memory address of each value of your column. For columns with low cardinality (the amount of unique values is lower than 50% of the count of these values), this can be optimized by forcing pandas to use a virtual mapping table where all unique values are mapped via an integer instead of a pointer. This is done using the category datatype.
sales_bd = np.round(sales.memory_usage().sum()/(1024*1024),1)
calendar_bd = np.round(calendar.memory_usage().sum()/(1024*1024),1)
prices_bd = np.round(prices.memory_usage().sum()/(1024*1024),1)
#Downcast in order to save memory
def downcast(df):
cols = df.dtypes.index.tolist()
types = df.dtypes.values.tolist()
for i,t in enumerate(types):
if 'int' in str(t):
if df[cols[i]].min() > np.iinfo(np.int8).min and df[cols[i]].max() < np.iinfo(np.int8).max:
df[cols[i]] = df[cols[i]].astype(np.int8)
elif df[cols[i]].min() > np.iinfo(np.int16).min and df[cols[i]].max() < np.iinfo(np.int16).max:
df[cols[i]] = df[cols[i]].astype(np.int16)
elif df[cols[i]].min() > np.iinfo(np.int32).min and df[cols[i]].max() < np.iinfo(np.int32).max:
df[cols[i]] = df[cols[i]].astype(np.int32)
else:
df[cols[i]] = df[cols[i]].astype(np.int64)
elif 'float' in str(t):
if df[cols[i]].min() > np.finfo(np.float16).min and df[cols[i]].max() < np.finfo(np.float16).max:
df[cols[i]] = df[cols[i]].astype(np.float16)
elif df[cols[i]].min() > np.finfo(np.float32).min and df[cols[i]].max() < np.finfo(np.float32).max:
df[cols[i]] = df[cols[i]].astype(np.float32)
else:
df[cols[i]] = df[cols[i]].astype(np.float64)
elif t == np.object:
if cols[i] == 'date':
df[cols[i]] = pd.to_datetime(df[cols[i]], format='%Y-%m-%d')
else:
df[cols[i]] = df[cols[i]].astype('category')
return df
sales = downcast(sales)
prices = downcast(prices)
calendar = downcast(calendar)
sales_ad = np.round(sales.memory_usage().sum()/(1024*1024),1)
calendar_ad = np.round(calendar.memory_usage().sum()/(1024*1024),1)
prices_ad = np.round(prices.memory_usage().sum()/(1024*1024),1)
Below plot shows how much effect downcasting has had on the memory usage of DataFrames. Clearly, we have been able to reduce sales & prices to less than 1/4th of their actual memory usage. calendar is already a small dataframe.
dic = {'DataFrame':['sales','calendar','prices'],
'Before downcasting':[sales_bd,calendar_bd,prices_bd],
'After downcasting':[sales_ad,calendar_ad,prices_ad]}
memory = pd.DataFrame(dic)
memory = pd.melt(memory, id_vars='DataFrame', var_name='Status', value_name='Memory (MB)')
memory.sort_values('Memory (MB)',inplace=True)
fig = px.bar(memory, x='DataFrame', y='Memory (MB)', color='Status', barmode='group', text='Memory (MB)')
fig.update_traces(texttemplate='%{text} MB', textposition='outside')
fig.update_layout(template='seaborn', title='Effect of Downcasting')
fig.show()
3. Melting the data#
Currently, the data is in three dataframes: sales, prices & calendar. The sales dataframe contains daily sales data with days(d_1 - d_1969) as columns. The prices dataframe contains items’ price details and calendar contains data about the days d.
3.1 Convert from wide to long format
Here's an example of conversion of a wide dataframe to a long dataframe.
In this case what the melt function is doing is that it is converting the sales dataframe which is in wide format to a long format. I have kept the id variables as id, item_id, dept_id, cat_id, store_id and state_id. They have in total 30490 unique values when compunded together. Now the total number of days for which we have the data is 1969 days. Therefore the melted dataframe will be having 30490x1969 i.e. 60034810 rows
df = pd.melt(sales, id_vars=['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], var_name='d', value_name='sold').dropna()
3.2 Combine the data
Combine price data from prices dataframe and days data from calendar dataset.df = pd.merge(df, calendar, on='d', how='left')
df = pd.merge(df, prices, on=['store_id','item_id','wm_yr_wk'], how='left')
Now I have a single and a complete dataframe with all the data required. Let's Explore!
4. Exploratory Data Analysis#
4.1 The Dataset
The M5 dataset, generously made available by Walmart, involves the unit sales of various products sold in the USA, organized in the form of grouped time series. More specifically, the dataset involves the unit sales of 3,049 products, classified in 3 product categories (Hobbies, Foods, and Household) and 7 product departments, in which the above-mentioned categories are disaggregated. The products are sold across ten stores, located in three States (CA, TX, and WI).
I have drawn an interactive visualization showing the distribution of 3049 items across different aggregation levels.group = sales.groupby(['state_id','store_id','cat_id','dept_id'],as_index=False)['item_id'].count().dropna()
group['USA'] = 'United States of America'
group.rename(columns={'state_id':'State','store_id':'Store','cat_id':'Category','dept_id':'Department','item_id':'Count'},inplace=True)
group = group[group["Count"] != 0]
fig = px.treemap(group, path=['USA', 'State', 'Store', 'Category', 'Department'], values='Count',
color='Count',
color_continuous_scale= px.colors.sequential.Sunset,
title='Walmart: Distribution of items')
fig.update_layout(template='seaborn')
fig.show()
4.2 Item Prices
Here I'll be studying about the item prices and their distribution. Please note the prices vary weekly. So to study the distribution of prices I have taken their average.group_price_store = df.groupby(['state_id','store_id','item_id'],as_index=False)['sell_price'].mean().dropna()
fig = px.violin(group_price_store, x='store_id', color='state_id', y='sell_price',box=True, hover_name='item_id')
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Selling Price($)')
fig.update_layout(template='seaborn',title='Distribution of Items prices wrt Stores',legend_title_text='State')
fig.show()
Below are some of the observations from the above plot:-
- The distribution of item prices is almost uniform for all the stores across Califoria, Texas and Wisconsin.
- Item HOBBIES_1_361 priced at around 30.5 dollars is the costliest item being sold at walmarts across California.
- Item HOUSEHOLD_1_060 priced at around 29.875 dollars is the costliest item being sold at walmarts across Texas.
- Item HOBBIES_1_361 priced at around 30.48 dollars is the costliest item being sold at TX_1 and TX_3 in Texas. While item HOBBIES_1_255 priced at around 30.5 dollars is the costliest at TX_2
group_price_cat = df.groupby(['store_id','cat_id','item_id'],as_index=False)['sell_price'].mean().dropna()
fig = px.violin(group_price_cat, x='store_id', color='cat_id', y='sell_price',box=True, hover_name='item_id')
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Selling Price($)')
fig.update_layout(template='seaborn',title='Distribution of Items prices wrt Stores across Categories',
legend_title_text='Category')
fig.show()
As can be seen from the plot above, food category items are quite cheap as compared with hobbies and household items. Hobbies and household items have almost the same price range.
4.3 Items Sold
Let's study the sales accross all the stores.group = df.groupby(['year','date','state_id','store_id'], as_index=False)['sold'].sum().dropna()
fig = px.violin(group, x='store_id', color='state_id', y='sold',box=True)
fig.update_xaxes(title_text='Store')
fig.update_yaxes(title_text='Total items sold')
fig.update_layout(template='seaborn',title='Distribution of Items sold wrt Stores',legend_title_text='State')
fig.show()
Below are some of the observations from the above plot:-
- California: CA_3 has sold the most number of items while, CA_4 has sold the least number of items.
- Texas: TX_2 and **TX_3** have sold the maximum number of items. TX_1 has sold the least number of items.
- Wisconsin: WI_2 has sold the maximum number of items while, WI_3 has sold the least number of items.
- USA: CA_3 has sold the most number of items while, CA_4 has sold the least number of items.
Let’s study number of items sold over time across all the stores.
fig = go.Figure()
title = 'Items sold over time'
years = group.year.unique().tolist()
buttons = []
y=3
for state in group.state_id.unique().tolist():
group_state = group[group['state_id']==state]
for store in group_state.store_id.unique().tolist():
group_state_store = group_state[group_state['store_id']==store]
fig.add_trace(go.Scatter(name=store, x=group_state_store['date'], y=group_state_store['sold'], showlegend=True,
yaxis='y'+str(y) if y!=1 else 'y'))
y-=1
fig.update_layout(
xaxis=dict(
#autorange=True,
range = ['2011-01-29','2016-05-22'],
rangeselector=dict(
buttons=list([
dict(count=1,
label="1m",
step="month",
stepmode="backward"),
dict(count=6,
label="6m",
step="month",
stepmode="backward"),
dict(count=1,
label="YTD",
step="year",
stepmode="todate"),
dict(count=1,
label="1y",
step="year",
stepmode="backward"),
dict(count=2,
label="2y",
step="year",
stepmode="backward"),
dict(count=3,
label="3y",
step="year",
stepmode="backward"),
dict(count=4,
label="4y",
step="year",
stepmode="backward"),
dict(step="all")
])
),
rangeslider=dict(
autorange=True,
),
type="date"
),
yaxis=dict(
anchor="x",
autorange=True,
domain=[0, 0.33],
mirror=True,
showline=True,
side="left",
tickfont={"size":10},
tickmode="auto",
ticks="",
title='WI',
titlefont={"size":20},
type="linear",
zeroline=False
),
yaxis2=dict(
anchor="x",
autorange=True,
domain=[0.33, 0.66],
mirror=True,
showline=True,
side="left",
tickfont={"size":10},
tickmode="auto",
ticks="",
title = 'TX',
titlefont={"size":20},
type="linear",
zeroline=False
),
yaxis3=dict(
anchor="x",
autorange=True,
domain=[0.66, 1],
mirror=True,
showline=True,
side="left",
tickfont={"size":10},
tickmode="auto",
ticks='',
title="CA",
titlefont={"size":20},
type="linear",
zeroline=False
)
)
fig.update_layout(template='seaborn', title=title)
fig.show()
4.4 State wise Analysis
California1 Texas2 Wisconsin3In this section, I will be studying the sales and revenue of all the stores individually across all the three states: California, Texas & Wisconsin. I have plotted total three plots for each store: CA_1, CA_2, CA_3, CA_4, TX_1, TX_2, TX_3, WI_1, WI_2 & WI_3. Details about the plots are as follows:- - First plot shows the daily sales of a store. I have plotted the values separately for SNAP days. Also, SNAP promotes food purchase, I have plotted food sales as well to check if it really affects the food sales. - Second plot shows the daily revenue of a store with separate plotting for SNAP days. - Third is a heatmap to show daily sales. It's plotted in such a way that it becomes easier to see day wise values.
The United States federal government provides a nutrition assistance benefit called the Supplement Nutrition Assistance Program (SNAP). SNAP provides low income families and individuals with an Electronic Benefits Transfer debit card to purchase food products. In many states, the monetary benefits are dispersed to people across 10 days of the month and on each of these days 1/10 of the people will receive the benefit on their card. More information about the SNAP program can be found [here.](https://www.fns.usda.gov/snap/supplemental-nutrition-assistance-program)
df['revenue'] = df['sold']*df['sell_price'].astype(np.float32)
def introduce_nulls(df):
idx = pd.date_range(df.date.dt.date.min(), df.date.dt.date.max())
df = df.set_index('date')
df = df.reindex(idx)
df.reset_index(inplace=True)
df.rename(columns={'index':'date'},inplace=True)
return df
def plot_metric(df,state,store,metric):
store_sales = df[(df['state_id']==state)&(df['store_id']==store)&(df['date']<='2016-05-22')]
food_sales = store_sales[store_sales['cat_id']=='FOODS']
store_sales = store_sales.groupby(['date','snap_'+state],as_index=False)['sold','revenue'].sum()
snap_sales = store_sales[store_sales['snap_'+state]==1]
non_snap_sales = store_sales[store_sales['snap_'+state]==0]
food_sales = food_sales.groupby(['date','snap_'+state],as_index=False)['sold','revenue'].sum()
snap_foods = food_sales[food_sales['snap_'+state]==1]
non_snap_foods = food_sales[food_sales['snap_'+state]==0]
non_snap_sales = introduce_nulls(non_snap_sales)
snap_sales = introduce_nulls(snap_sales)
non_snap_foods = introduce_nulls(non_snap_foods)
snap_foods = introduce_nulls(snap_foods)
fig = go.Figure()
fig.add_trace(go.Scatter(x=non_snap_sales['date'],y=non_snap_sales[metric],
name='Total '+metric+'(Non-SNAP)'))
fig.add_trace(go.Scatter(x=snap_sales['date'],y=snap_sales[metric],
name='Total '+metric+'(SNAP)'))
fig.add_trace(go.Scatter(x=non_snap_foods['date'],y=non_snap_foods[metric],
name='Food '+metric+'(Non-SNAP)'))
fig.add_trace(go.Scatter(x=snap_foods['date'],y=snap_foods[metric],
name='Food '+metric+'(SNAP)'))
fig.update_yaxes(title_text='Total items sold' if metric=='sold' else 'Total revenue($)')
fig.update_layout(template='seaborn',title=store)
fig.update_layout(
xaxis=dict(
#autorange=True,
range = ['2011-01-29','2016-05-22'],
rangeselector=dict(
buttons=list([
dict(count=1,
label="1m",
step="month",
stepmode="backward"),
dict(count=6,
label="6m",
step="month",
stepmode="backward"),
dict(count=1,
label="YTD",
step="year",
stepmode="todate"),
dict(count=1,
label="1y",
step="year",
stepmode="backward"),
dict(count=2,
label="2y",
step="year",
stepmode="backward"),
dict(count=3,
label="3y",
step="year",
stepmode="backward"),
dict(count=4,
label="4y",
step="year",
stepmode="backward"),
dict(step="all")
])
),
rangeslider=dict(
autorange=True,
),
type="date"
))
return fig
cal_data = group.copy()
cal_data = cal_data[cal_data.date <= '22-05-2016']
cal_data['week'] = cal_data.date.dt.weekofyear
cal_data['day_name'] = cal_data.date.dt.day_name()
def calmap(cal_data, state, store, scale):
cal_data = cal_data[(cal_data['state_id']==state)&(cal_data['store_id']==store)]
years = cal_data.year.unique().tolist()
fig = make_subplots(rows=len(years),cols=1,shared_xaxes=True,vertical_spacing=0.005)
r=1
for year in years:
data = cal_data[cal_data['year']==year]
data = introduce_nulls(data)
fig.add_trace(go.Heatmap(
z=data.sold,
x=data.week,
y=data.day_name,
hovertext=data.date.dt.date,
coloraxis = "coloraxis",name=year,
),r,1)
fig.update_yaxes(title_text=year,tickfont=dict(size=5),row = r,col = 1)
r+=1
fig.update_xaxes(range=[1,53],tickfont=dict(size=10), nticks=53)
fig.update_layout(coloraxis = {'colorscale':scale})
fig.update_layout(template='seaborn', title=store)
return fig
CA_1#
fig = plot_metric(df,'CA','CA_1','sold')
fig.show()
fig = plot_metric(df,'CA','CA_1','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_1', 'magma')
fig.show()
CA_2#
fig = plot_metric(df,'CA','CA_2','sold')
fig.show()
fig = plot_metric(df,'CA','CA_2','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_2', 'magma')
fig.show()
CA_3#
fig = plot_metric(df,'CA','CA_3','sold')
fig.show()
fig = plot_metric(df,'CA','CA_3','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_3', 'magma')
fig.show()
CA_4#
fig = plot_metric(df,'CA','CA_4','sold')
fig.show()
fig = plot_metric(df,'CA','CA_4','revenue')
fig.show()
fig = calmap(cal_data, 'CA', 'CA_4', 'magma')
fig.show()
TX_1#
fig = plot_metric(df,'TX','TX_1','sold')
fig.show()
fig = plot_metric(df,'TX','TX_1','revenue')
fig.show()
fig = calmap(cal_data, 'TX', 'TX_1', 'viridis')
fig.show()
TX_2#
fig = plot_metric(df,'TX','TX_2','sold')
fig.show()
fig = plot_metric(df,'TX','TX_2','revenue')
fig.show()
fig = calmap(cal_data, 'TX', 'TX_2', 'viridis')
fig.show()
TX_3#
fig = plot_metric(df,'TX','TX_3','sold')
fig.show()
fig = plot_metric(df,'TX','TX_3','revenue')
fig.show()
fig = calmap(cal_data, 'TX', 'TX_3', 'viridis')
fig.show()
Wisconsin
Store Navigator
WI_1#
fig = plot_metric(df,'WI','WI_1','sold')
fig.show()
fig = plot_metric(df,'WI','WI_1','revenue')
fig.show()
fig = calmap(cal_data, 'WI', 'WI_1', 'twilight')
fig.show()
WI_2#
fig = plot_metric(df,'WI','WI_2','sold')
fig.show()
fig = plot_metric(df,'WI','WI_2','revenue')
fig.show()
fig = calmap(cal_data, 'WI', 'WI_2', 'twilight')
fig.show()
WI_3#
fig = plot_metric(df,'WI','WI_3','sold')
fig.show()
fig = plot_metric(df,'WI','WI_3','revenue')
fig.show()
fig = calmap(cal_data, 'WI', 'WI_3', 'twilight')
fig.show()
df = df.drop("revenue", 1)
5. Feature Engineering#
Time Series data must be re-framed as a supervised learning dataset before we can start using machine learning algorithms.
There is no concept of input and output features in time series. Instead, we must choose the variable to be predicted and use feature engineering to construct all of the inputs that will be used to make predictions for future time steps.
The goal of feature engineering is to provide strong and ideally simple relationships between new input features and the output feature for the supervised learning algorithm to model.
- Remove unwanted data to create space in RAM for further processing.
- Label Encode categorical features.(I had converted already converted categorical variable to category type. So, I can simply use their codes instead of using LableEncoder)
- Remove date as its features are already present.
I'm storing the categories correponding to their respective category codes so that I'can use them later on while making the submission.
#Store the categories along with their codes
d_id = dict(zip(df.id.cat.codes, df.id))
d_item_id = dict(zip(df.item_id.cat.codes, df.item_id))
d_dept_id = dict(zip(df.dept_id.cat.codes, df.dept_id))
d_cat_id = dict(zip(df.cat_id.cat.codes, df.cat_id))
d_store_id = dict(zip(df.store_id.cat.codes, df.store_id))
d_state_id = dict(zip(df.state_id.cat.codes, df.state_id))
#1
del group, group_price_cat, group_price_store, group_state, group_state_store, cal_data
gc.collect();
#2
df.d = df['d'].apply(lambda x: x.split('_')[1]).astype(np.int16)
cols = df.dtypes.index.tolist()
types = df.dtypes.values.tolist()
for i,type in enumerate(types):
if type.name == 'category':
df[cols[i]] = df[cols[i]].cat.codes
#3
df.drop('date',axis=1,inplace=True)
5.2 Introduce Lags
Go back to topics
Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems.
Introduce lags to the the target variable sold. The maximum lag I have introduced is 36 days. It’s purely upto you how many lags you want to introduce.
#Introduce lags
lags = [1,2,3,6,12,24,36]
for lag in lags:
df['sold_lag_'+str(lag)] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'],as_index=False)['sold'].shift(lag).astype(np.float16)
5.3 Mean Encoding
Go back to topics
From a mathematical point of view, mean encoding represents a probability of your target variable, conditional on each value of the feature. In a way, it embodies the target variable in its encoded value. I have calculated mean encodings on the basis of following logical features I could think of:-
item
state
store
category
department
category & department
store & item
category & item
department & item
state & store
state, store and category
store, category and department
Feel free to add more!
df['iteam_sold_avg'] = df.groupby('item_id')['sold'].transform('mean').astype(np.float16)
df['state_sold_avg'] = df.groupby('state_id')['sold'].transform('mean').astype(np.float16)
df['store_sold_avg'] = df.groupby('store_id')['sold'].transform('mean').astype(np.float16)
df['cat_sold_avg'] = df.groupby('cat_id')['sold'].transform('mean').astype(np.float16)
df['dept_sold_avg'] = df.groupby('dept_id')['sold'].transform('mean').astype(np.float16)
df['cat_dept_sold_avg'] = df.groupby(['cat_id','dept_id'])['sold'].transform('mean').astype(np.float16)
df['store_item_sold_avg'] = df.groupby(['store_id','item_id'])['sold'].transform('mean').astype(np.float16)
df['cat_item_sold_avg'] = df.groupby(['cat_id','item_id'])['sold'].transform('mean').astype(np.float16)
df['dept_item_sold_avg'] = df.groupby(['dept_id','item_id'])['sold'].transform('mean').astype(np.float16)
df['state_store_sold_avg'] = df.groupby(['state_id','store_id'])['sold'].transform('mean').astype(np.float16)
df['state_store_cat_sold_avg'] = df.groupby(['state_id','store_id','cat_id'])['sold'].transform('mean').astype(np.float16)
df['store_cat_dept_sold_avg'] = df.groupby(['store_id','cat_id','dept_id'])['sold'].transform('mean').astype(np.float16)
5.4 Rolling Window Statistics
Go back to topics
Here’s an awesome gif that explains this idea in a wonderfully intuitive way:
This method is called the rolling window method because the window would be different for every data point.
I’ll be calculating weekly rolling avearge of the items sold. More features like rolling min, max or sum can also be calculated. Also, same features can be calculated for revenue as well.
df['rolling_sold_mean'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'])['sold'].shift(1).transform(lambda x: x.rolling(window=7).mean()).astype(np.float16)
5.5 Expanding Window Statistics
Go back to topics
This is simply an advanced version of the rolling window technique. In the case of a rolling window, the size of the window is constant while the window slides as we move forward in time. Hence, we consider only the most recent values and ignore the past values. Here’s a gif that explains how our expanding window function works:

I’ll be calculating expanding avearge of the items sold. More features like expanding min, max or sum can also be calculated. Also, same features can be calculated for revenue as well.
df['expanding_sold_mean'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'])['sold'].shift(1).transform(lambda x: x.expanding(2).mean()).astype(np.float16)
I will be creating a selling trend feature, which will be some positive value if the daily items sold are greater than the entire duration average ( d_1 - d_1969 ) else negative. More trend features can be added but I’ll only add this one to keep it simple.
df['daily_avg_sold'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id','d'])['sold'].transform('mean').astype(np.float16)
df['avg_sold'] = df.groupby(['id', 'item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'])['sold'].transform('mean').astype(np.float16)
df['selling_trend'] = (df['daily_avg_sold'] - df['avg_sold']).astype(np.float16)
df.drop(['daily_avg_sold','avg_sold'],axis=1,inplace=True)
5.7 Save the data
Go back to topics
Now since all the new features are created, let’s save the data so that it can be trained separately.Also, lags introduce a lot of Null values, so I’ll remove data for first 35 days as I have introduced lags till 36 days.
df = df[df['d']>=36]
Let’s look at our new dataframe.
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 58967660 entries, 1067150 to 60034809
Data columns (total 43 columns):
# Column Dtype
--- ------ -----
0 id int16
1 item_id int16
2 dept_id int8
3 cat_id int8
4 store_id int8
5 state_id int8
6 d int16
7 sold int16
8 wm_yr_wk int16
9 weekday int8
10 wday int8
11 month int8
12 year int16
13 event_name_1 int8
14 event_type_1 int8
15 event_name_2 int8
16 event_type_2 int8
17 snap_CA int8
18 snap_TX int8
19 snap_WI int8
20 sell_price float16
21 sold_lag_1 float16
22 sold_lag_2 float16
23 sold_lag_3 float16
24 sold_lag_6 float16
25 sold_lag_12 float16
26 sold_lag_24 float16
27 sold_lag_36 float16
28 iteam_sold_avg float16
29 state_sold_avg float16
30 store_sold_avg float16
31 cat_sold_avg float16
32 dept_sold_avg float16
33 cat_dept_sold_avg float16
34 store_item_sold_avg float16
35 cat_item_sold_avg float16
36 dept_item_sold_avg float16
37 state_store_sold_avg float16
38 state_store_cat_sold_avg float16
39 store_cat_dept_sold_avg float16
40 rolling_sold_mean float16
41 expanding_sold_mean float16
42 selling_trend float16
dtypes: float16(23), int16(6), int8(14)
memory usage: 4.4 GB
Save the data for training.
df.to_pickle('data.pkl')
del df
gc.collect();
6. Modelling and Prediction#
data = pd.read_pickle('data.pkl')
valid = data[(data['d']>=1914) & (data['d']<1942)][['id','d','sold']]
test = data[data['d']>=1942][['id','d','sold']]
eval_preds = test['sold']
valid_preds = valid['sold']
#Get the store ids
model_list = []
stores = sales.store_id.cat.codes.unique().tolist()
for store in stores:
df = data[data['store_id']==store]
#Split the data
X_train, y_train = df[df['d']<1914].drop('sold',axis=1), df[df['d']<1914]['sold']
X_valid, y_valid = df[(df['d']>=1914) & (df['d']<1942)].drop('sold',axis=1), df[(df['d']>=1914) & (df['d']<1942)]['sold']
X_test = df[df['d']>=1942].drop('sold',axis=1)
#Train and validate
model = LGBMRegressor(
n_estimators=1000,
learning_rate=0.3,
subsample=0.8,
colsample_bytree=0.8,
max_depth=8,
num_leaves=50,
min_child_weight=300
)
print('*****Prediction for Store: {}*****'.format(d_store_id[store]))
model.fit(X_train, y_train, eval_set=[(X_train,y_train),(X_valid,y_valid)],
eval_metric='rmse', verbose=20, early_stopping_rounds=20)
valid_preds[X_valid.index] = model.predict(X_valid)
eval_preds[X_test.index] = model.predict(X_test)
model_list.append(model)
# Clear memory
del model, X_train, y_train, X_valid, y_valid
gc.collect()
*****Prediction for Store: CA_1*****
[20] training's rmse: 0.904045 training's l2: 0.817298 valid_1's rmse: 0.568944 valid_1's l2: 0.323697
[40] training's rmse: 0.889724 training's l2: 0.791609 valid_1's rmse: 0.567389 valid_1's l2: 0.32193
[60] training's rmse: 0.880293 training's l2: 0.774915 valid_1's rmse: 0.564904 valid_1's l2: 0.319116
[80] training's rmse: 0.872382 training's l2: 0.761049 valid_1's rmse: 0.563527 valid_1's l2: 0.317562
*****Prediction for Store: CA_2*****
[20] training's rmse: 0.580716 training's l2: 0.337231 valid_1's rmse: 0.53875 valid_1's l2: 0.290251
[40] training's rmse: 0.56762 training's l2: 0.322193 valid_1's rmse: 0.531685 valid_1's l2: 0.282689
[60] training's rmse: 0.563519 training's l2: 0.317554 valid_1's rmse: 0.52741 valid_1's l2: 0.278161
[80] training's rmse: 0.560069 training's l2: 0.313677 valid_1's rmse: 0.523966 valid_1's l2: 0.274541
[100] training's rmse: 0.55695 training's l2: 0.310193 valid_1's rmse: 0.522462 valid_1's l2: 0.272967
[120] training's rmse: 0.554513 training's l2: 0.307484 valid_1's rmse: 0.521643 valid_1's l2: 0.272111
[140] training's rmse: 0.552004 training's l2: 0.304708 valid_1's rmse: 0.523552 valid_1's l2: 0.274107
*****Prediction for Store: CA_3*****
[20] training's rmse: 1.3662 training's l2: 1.86649 valid_1's rmse: 0.715061 valid_1's l2: 0.511312
[40] training's rmse: 1.33013 training's l2: 1.76924 valid_1's rmse: 0.701512 valid_1's l2: 0.492119
[60] training's rmse: 1.30702 training's l2: 1.70831 valid_1's rmse: 0.69533 valid_1's l2: 0.483484
[80] training's rmse: 1.28812 training's l2: 1.65926 valid_1's rmse: 0.689454 valid_1's l2: 0.475347
[100] training's rmse: 1.2693 training's l2: 1.61113 valid_1's rmse: 0.67785 valid_1's l2: 0.459481
[120] training's rmse: 1.25489 training's l2: 1.57474 valid_1's rmse: 0.681503 valid_1's l2: 0.464446
*****Prediction for Store: CA_4*****
[20] training's rmse: 0.422804 training's l2: 0.178763 valid_1's rmse: 0.348529 valid_1's l2: 0.121472
[40] training's rmse: 0.416184 training's l2: 0.173209 valid_1's rmse: 0.342484 valid_1's l2: 0.117295
[60] training's rmse: 0.41293 training's l2: 0.170511 valid_1's rmse: 0.340811 valid_1's l2: 0.116152
[80] training's rmse: 0.411131 training's l2: 0.169029 valid_1's rmse: 0.340461 valid_1's l2: 0.115914
*****Prediction for Store: TX_1*****
[20] training's rmse: 0.852924 training's l2: 0.727479 valid_1's rmse: 0.52504 valid_1's l2: 0.275667
[40] training's rmse: 0.835764 training's l2: 0.698501 valid_1's rmse: 0.51213 valid_1's l2: 0.262277
[60] training's rmse: 0.827983 training's l2: 0.685556 valid_1's rmse: 0.509391 valid_1's l2: 0.259479
[80] training's rmse: 0.821042 training's l2: 0.674111 valid_1's rmse: 0.507065 valid_1's l2: 0.257115
Plotting feature importances
feature_importance_df = pd.DataFrame()
features = [f for f in data.columns if f != 'sold']
for model, store in zip(model_list, stores):
store_importance_df = pd.DataFrame()
store_importance_df["feature"] = features
store_importance_df["importance"] = model.feature_importances_
store_importance_df["store"] = store
feature_importance_df = pd.concat([feature_importance_df, store_importance_df], axis=0)
def display_importances(feature_importance_df_):
cols = feature_importance_df_[["feature", "importance"]].groupby("feature").mean().sort_values(by="importance", ascending=False)[:20].index
best_features = feature_importance_df_.loc[feature_importance_df_.feature.isin(cols)]
plt.figure(figsize=(8, 10))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
plt.title('LightGBM Features (averaged over store predictions)')
plt.tight_layout()
display_importances(feature_importance_df)
Make the submission
If you remember for EDA, feature engineering and training I had melted the provided data from wide format to long format. Now, I have the predictions in long format but the format to be evaluated for the competition is in long format. Therefore, I’ll convert it into wide format using pivot function in pandas. Below is an image explaining the pivot function.

#Set actual equal to false if you want to top in the public leaderboard :P
actual = False
if actual == False:
#Get the validation results(We already have them as less than one month left for competition to end)
validation = sales[['id']+['d_' + str(i) for i in range(1914,1942)]]
validation['id']=pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_validation.csv').id
validation.columns=['id'] + ['F' + str(i + 1) for i in range(28)]
else:
#Get the actual validation results
valid['sold'] = valid_preds
validation = valid[['id','d','sold']]
validation = pd.pivot(validation, index='id', columns='d', values='sold').reset_index()
validation.columns=['id'] + ['F' + str(i + 1) for i in range(28)]
validation.id = validation.id.map(d_id).str.replace('evaluation','validation')
#Get the evaluation results
test['sold'] = eval_preds
evaluation = test[['id','d','sold']]
evaluation = pd.pivot(evaluation, index='id', columns='d', values='sold').reset_index()
evaluation.columns=['id'] + ['F' + str(i + 1) for i in range(28)]
#Remap the category id to their respective categories
evaluation.id = evaluation.id.map(d_id)
#Prepare the submission
submit = pd.concat([validation,evaluation]).reset_index(drop=True)
submit.to_csv('submission.csv',index=False)
Resources#
This demo was adapted from the following kaggle notebook.